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IMPROVED SUPPORT VECTOR MACHINE 

FIELD OF THE INVENTION 

The present invention relates to a method for selecting a reduced set of support 
vectors for use during a training phase of a support vector machine. 

BACKGROUND TO THE INVENTION 

A Support Vector Machine (SVM) is a universal learning machine that, during 
a training phase, determines a decision surface or "hyperplane". The decision 
hyperplane is determined by a set of support vectors selected from a training 
population of vectors and by a set of corresponding multipliers. The decision 
hyperplane is also characterised by a kernel function. 

Subsequent to the training phase a SVM operates in a testing phase during 
which it is used to classify test vectors on the basis of the decision hyperplane 
previously determined during the training phase. A problem arises however as the 
complexity of the computations that must be undertaken to make a decision scales 
with the number of support vectors used to detemriine the hyperplane. 

Support Vector Machines find application in many and varied fields. For 
example, in an article by S. Lyu and H. Farid entitled "Detecting Hidden Messages 
using Higher-Order Statistics and Support Vector Machines" {Sth International 
Workshop on Information Hiding. Noordwijkerhout. The Netheriands. 2002) there is a 
description of the use of an SVM to discriminate between untouched and adulterated 
digital images. 

Alternatively, in a paper by H, Kim and H. Parte entitled "Prediction of protein 
relative solvent accessibility with support vector machines and long-range interaction 
3d local descriptor" {Proteins: structure, function and genetics, to be published) 
SVMs are applied to the problem of predicting high resolution 3D structure in order to 
study the docking of macro-molecules. 

The mathematical basis of a SVM will now be explained. An SVM is a 
learning machine that selects m random vectors xe drawn independently from the 
probability distribution function p(x). The system then returns an output value for 
every input vector x/, such thaty(x,) = y,. 



The (x,, yi) / = 0,...m are referred to as the training examples. The resulting 
function y(x) detemnines the hyperplane which is then used to estimate unknown 
mappings. 

Figure 1, illustrates the above method. Each of steps 24, 26 and 28 of Figure 
1 are well known In the prior a.-t. 

With some manipulations of the governing equations the support vector 
machine can be phrased as the following Quadratic Programming problem: 



m\T\W(a) = '/aa'ila-o'e (1) 

where fty=>'{y/K(x,vx*) (2) 

e= [1,1,1.1 .if (3) 

Subject to 0=a'y (4) 

0<ai<C (5) 

where c is some regularlzatlon constant (6) 



The K(^,xd is the kernel function and can be viewed as a generali^d Inner 
product of two vectors. The result of training the SVM is the detemnlnatlon of the 
multipliers a,. 

Suppose we train a SVM classifier with pattern vectors xs, and that r of these 

vectors are determined to be support vectors. Denote them by xi, /=1,2 r. The 

decision hyperplane for pattern classification then takes the fomri 

Xx)=E aiyfisrcx^)^-* (7) 

where a, Is the Lagrange multiplier associated with pattern xi and KC,.) is a 
kernel function that implicitly maps the pattem vectors into a suitable feature space. 
The * can be determined independently of the a,. Figure 2 illustrates in two 
dimensions the separation of two classes by a hyperplane 30. Note that all of the x's 
and o's contained within a rectangle in Figure 2 are considered to be support vectors 
and would have associated non-zero a/. 

Now suppose that support vector Xk is linearly dependent on the other support 
vectors in feature space, i.e. 

j:(x,xk) = 2 (8) 
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where ci are some scalars. 

Then the decision surface defined by equation (7) can be written as 

Xx) = XI aiyiK(xA) + aiyk£ c,K(x,x,^ + b (9) 

5 

Now define ai^yifii - aiym so that (9) can be written 

Xx) = S a,(l + ydyi + b (10) 

a'iyiK(x^d + b (H) 
10 where a'i=«,(l+yi) (12) 

Comparing (1 1) and (7) we see that the linearly dependent support vector Xk Is 
not required in the representation of the decision surface. Note that the Lagrange 
multipliers must be modified in order to obtain the simplified representation. This 
15 process, (described in T. Downs, K. E. Gates, and A. Masters. "Exact simplification 
of support vector solutions". Journal of Machine Learning Research, 2:293-297, 200) 
is a successful way of reducing the support vectore after they have been calculated. 

Figure 3 depicts the same hyperplane as in Figure 2, but this time the number 
of support vectors has been reduced to just two vectors 32 through the process of 
20 detemnining a lineariy independent set of support vectors. 

Given either (11) or (7) an un-classified sample vector x may be classified by 
calculating y(x) and then returning -1 for all values less than zero and 1 for all values 
greater than zero. 

Figure 4 is a flow chart of a typical method employed by prior art SVMs for 
25 classifying an unknown vector. Steps 34 through 40 are defined in the literature and 
by equations (7) or (11). 

As previously alluded to, because the sets of training vectors may be very 
large and the time involved to train the SVM may be excessive it would be desirable 
if it were possible to undertake an a-priori reduction of the training set before the 
30 calculation of the support vectors. 
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It will be realised from the above discussion that a reduced set of vectors 
might be arrived at by choosing only linearly independent vectors. The determination 
of the linearly independent support vectors may be undertaken by any method 
commonly in use in linear algebra. Common methods would be the calculation with 

5 pivotina of the reduced row echelon form, the QR factors or the Singular value 
decomposition. Any of these methods would give a set of r linearly independent 
vectors that could then be used to calculate the Lagrange multipliers and a decision 
surface similar to that defined by equation (7). A problem arises however in that it is 
not clear how to optimally select the support vectors that will be kept in the set. 

0 It is an object of the present invention to provkle an improved method for 

selecting support vectors in a Support Vector Machine. 

SUMMARY OF THE INVENTION 

According to a first aspect of the present invention there is provided a method for 
5 operating a computational device as a support vector machine in order to define a 
decision surface separating two opposing classes of a training set of vectors, ttie 
method including the steps of: 

associating a distance parameter with each vector of the training set, the 
distance parameter indicating a distance from its associated vector to the opposite 
20 class; and 

detennining a linearly independent set of support vectors from the training set 
such ttiat the sum of ttie distances associated witti the linearly independent support 
vectors is minimised. 

The distance parameter may be the average of the distances from its 
25 associated vector to each of the vectors in the opposite class. 

Altematively, the distance parameter may be the shortest of the distances 
from its associated vector to each of the vectors in the opposite class. 

In a preferred embodiment the distance parameter is calculated according to 
the equation |v - up = ^(u. u) + K(y. v) - 2 K{\. n) where v and u are vectors and K is 
30 a kemel function used to define the decision surface. 

The step of detemiining a lineariy independent set of support vectors may be 
perfonned by using at least any of the following methods: rank revealing QR 
reduction, or reduced row echelon fonti with pivoting on the vector having the 
smallest associated distance parameter. 
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According to a further embodiment of the invention there is provided a 
computer readable medium containing Instructions for processing by a computational 
device for implementing the method summarised above. 

The computational device may comprise a conventional computer system 
5 such as a personal computer. 

Further preferred features of the present invention will be described in the 
following detailed description of an exemplary embodiment wherein reference will be 
made to a number of figures as follows. 

1 0 BRIEF DESCRIPTION OF THE DRAWINGS 

In order that this invention may be more readily understood and put into practical 
effect, reference will now be made to the accompanying drawings which illustrate a 
typical preferred embodiment of the invention and wherein: 

Figure 1 is a flowchart depicting a training phase during implementation of a 
1 5 prior art support vector machine. 

Figure 2 is a diagram showing a number of support vectors on either side of a 
decision hyperplane. 

Figure 3 is a diagram showing a reduced set of support vectors on either side 
of a decision hyperplane. 
20 Figure 4 is a flowchart depicting a testing phase during implementation of a 

prior art support vector machine. 

Figure 5 is a flowchart depicting a training phase method according to a 
preferred embodiment of the present invention. 

Figure 6 is a block diagram of a computer system for executing a software 
25 product according to the present invention. 

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS 

Vapnik in his book Statistical Learning Vieory (Wiley, New York. 1998) has shown 
that the support vector machine selects the hyperplane that minimizes the 
30 generalization error, or at least an upper bound on it. The hyperplane with this 
property is the one that leaves the maximum margin between the two classes, where 
the margin is defined as the sum of the distances of the hyperplane from the closest 
points of the two classes. The support vector machine works on finding the 
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maximum margin separating the hyperplane between two subject groups through the 
minimization of a given quadratic programming problem. 

The present inventor has realised that given that it is desirable to find the 
maximum margin, and that we can calculate the distance between any two points in 
the test set, the optimal vectors to preselect as potential support vectors are those 
closest to the decision hyperplane. The vectors closest will be the ones with the 
minimum distance to the opposing class. 

The distance between two vectors in a plane (n, v) can be defined by the 
magnitude of the difference between them |v - u| or 

|v - up = lu|^ + |vp - 2|ul |v| cos e (13) 

where B is the angle between them. But 

T 

''''^^ |u||v| (14) 

SO |v - up = |up -f |vp - 2 u (15) 



In support vector machines the inner product is replaced by a generalized inner 
20 product expressed by K{\, u). In the mathematical language of support vector 
machine equation (15) is written as: 

|v - up = K{u, u) + K(y, v). - 2 ^:(v. u). (16) 

25 We can define this distance in at least two ways. The average distance from a vector 
to all vectors in the other class or the shortest distances from the vector to any vector 
in the other class. Both altematives work well. Given a set of vectors of size p, the 
shortest distance from each vector to the opposing class is calculated in feature 
space. The vectors with the smallest distance are then selected as pivots in either 

30 the calculation of the row reduced echelon form of Gaussian Elimination, the Rank- 
Revealing QR of the SVD. The pivots are known a priori which will make online 
learning feasible for the support vector machine. Proceeding in this way by pivoting 
the vector with the smallest distance to the opposing set to the pivot position In the 




rank revealing algorithm, r linear independent vectors can be selected as the other 
p-r vectors can be considered linearly dependent on the initial r vectors. A reduced 
set of linear independent vectors to be trained in an SVM is thus arrived at Only the 
linear independent set is used as training vectors for the quadratic programming (QP) 
5 problem. 

Figure 5 is a flowchart of a method according to a preferred embodiment of 
the invention. The procedure at step 42 is the same as step 24 in the prior art 
method of Figure 1. Step 44 is also exactly the same as step 26 in the prior art 
method illustrated by Figure 1 . In step 46 however, the distance from each vector x/ 

10 to the opposite class, yt^yj is calculated using: 

Iv - up = K{vL, u) + K{v, v). . 2 X(v, u). (17) 
and then taking a sum of all the distances to other vectors xj where yii^yj or by taking 
the minimum distance to other vectors xj where >v # yj. In step 46 a linearly 
independent set of the vectors in feature space is calculated by using any method 

15 including the SVD, rank revealing QR or reduced row echelon form (see Golub and 
van Loan or any other linear algebra text) and pivoting on the smallest distance to the 
opposite class. Step 8 is identical to step 4 of the prior art and includes any solution 
method for the QP problem. 

A subsequent testing phase, wherein unknown vectors x are classified, would 

20 proceed according to the method depicted by the flowchart of Figure 4. Since the 
training vectors derived in the training phase are linearly independent, there can be 
no post reduction of the number of support vectors. However; the low number of 
support vectors in comparison to an unreduced support vector machine will lead to 
reductions in time in the testing phase in the evaluation of equation (7) or equation 

25 (11). 

The problem of online learning can be solved by calculating the distance from 
any new vector to the vectors in the lineariy independent set. These vectors are the 
closest to the boundary and should be the closest to any new vectors. If the newly 
calculated distance is smaller than a previous distance then the new vector is added 
30 to the independent set and the vector with the largest distance can be dropped from 
the set. The SVM will then need to be retrained with the new independent set. 

At this point the SVM is trained as in the literature with the r independent 
vectors. 
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From a practical point of view, an SVM according to a preferred embodiment 
of the present invention is implemented by means of a computational device, such as 
a personal computer, PDA, or potentially a wireless device such as a mobile phone. 
The computational device executes a software product containing instmctions for 
5 implementing a method according to the present invention, such as that illustrated in 
the flowchart of Figure 5. 

Figure 6 depicts a computational device in the form of a conventional personal 
computer system 52 which operates as an SVM according to the present invention 
while executing a support vector machine computer program. Personal Computer 

10 system 52 includes data entry devices in the form of pointing device 60 and Iceyboard 
58 and a data output device in the fomi of display 56. The data entry arid output 
devices are coupled to a processing box 54 which includes a central processing unit 
70. Central processing unit 70 internees with RAM 62, ROM 64 and second and 
storage device 66. Secondary storage device 66 includes an optical and/or magnetic 

15 data storage medium that bears instructions, for execution by central processor 70. 
The instructions constitute a software product 72 that when executed causes 
computer system 52 to operate as a support vector machine and in particular to 
implement the reduced support vector training phase method described above with 
reference to Figure 5 and equation 16. It will be realised by those skilled in the art 

20 that the programming of software product 72 is straightfonvard given the method of 
the present invention. 

The embodiments of the invention described herein are provided for purposes 
of explaining the principles thereof, and are not to be considered as limfting or 
restricting the invention since many modifications may be made by the exercise of 

25 skill in the art without departing from the scope of the invention. 

Dated this 31®* day of October 2003 
The University of Queensland 
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by my attomeys 
Eager Newcomb & Buck 
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Training Pliase 



Q START ^ 



Receive elements, x, of a training 
set witli a pre-assigned class, y. 
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Transfonn input data vectors 
by mapping into a multi- 
dimensional space 



Determine parameters of an 
optimal multi-dimensional 
hyperplane. 



CEO 
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Figure 1 
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Class (1) Class (2) 



Figure 2 
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Class (1) Class (2) 

Figure 3 
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Testing Phase 



I 



(START) 



Receive elements, Xj of a 
testing set 



Transform input data vectors 
by mapping into a multi- 
dimensional space using 

support vectors as parameters 
in the Icemel. 



Generate a classification 
signal from the decision 
surface to indicate 
membership status of each 
input data vector 
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Figure 4 
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Training Phase 



START ^ 



Receive elements, Xj of a training 
set with) a pre-assigned class, y. 



Transform input data vectors 
by mapping into a multi- 
dimensional space 



Calculate distance from each 
element, Xj, to the opposite 
class, Yj^^Yj. 



Find a linearly independent 
set of training elements such 
that the sum of the distance 

from each element to the 
opposite class is minimised. 



Determine parameters of an 
optimal multi-dimensional 
hyperplane. 

^ END ^ 
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Figure 5 
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Figure 6 
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